Joins

 

Rodney Dyer, PhD

Topics on Joins

  • Rationale
  • Common Joins
  • Filtering Joins
  • Spatial Joins
  • Airport Dataset

Rationale

Multi-Table Data

Rarely do we keep all but the most simple of data in a single table.

Table Structure

Consider the two tables to the right.

  • Common column labeled Key

  • Other data columns (only 1 in each for brevity)

Example data table structure

Varieties of Keys 🗝

Primary Key

A primary key is a column in a table that uniquely identifies a single row. There MUST be a unique identifier to be a Primary Key.

people <- data.frame( Key = c("A","B","C"),
                      Name = c("Bob","Alice","Mary"),
                      Major = c("ENVS", "ENVS", "BIOL") )
people %>%
  count( Key ) 
  Key n
1   A 1
2   B 1
3   C 1

Foreign Key

A foreign key is one that references a primary key in another table.

homework <- data.frame( Key = c("A","B","A","D","B"), 
                        Grade = rpois(5,lambda = 100) )
homework
  Key Grade
1   A   117
2   B    97
3   A    86
4   D   102
5   B    93

Here the Key column is referencing a unique row of data in the people data table.

Taxonomy of Joins

Taxonomy of Joins

We can combine the data in these tables in several different ways based upon what we are looking for. When thinking of joins, we must think about the how we want to select the overlapping sets of keys in both data.frames.

  • Full Join (aka outer join)

  • Left Join

  • Right Join

  • Inner Join

‘Left’ & ‘Right’ Tables

Example data table structure

Full/Outer Join

The outer join has all the data from both left & right tables. All keys are present in the result.

Outer Join

All homework and people data.

people %>%
  full_join( homework, by="Key" )
  Key  Name Major Grade
1   A   Bob  ENVS   117
2   A   Bob  ENVS    86
3   B Alice  ENVS    97
4   B Alice  ENVS    93
5   C  Mary  BIOL    NA
6   D  <NA>  <NA>   102

Left Join

The left join is one where the result has all the keys from the left but only those in the right one that are in the left.

people %>% left_join( homework, by="Key")
  Key  Name Major Grade
1   A   Bob  ENVS   117
2   A   Bob  ENVS    86
3   B Alice  ENVS    97
4   B Alice  ENVS    93
5   C  Mary  BIOL    NA

Right Join

The Right join results in all the keys from the right data table and the matching ones from the left.

people %>% right_join( homework, by="Key")
  Key  Name Major Grade
1   A   Bob  ENVS   117
2   A   Bob  ENVS    86
3   B Alice  ENVS    97
4   B Alice  ENVS    93
5   D  <NA>  <NA>   102

Inner Joins

Inner joins result in the intersection of keys.

people %>% inner_join( homework, by="Key")
  Key  Name Major Grade
1   A   Bob  ENVS   117
2   A   Bob  ENVS    86
3   B Alice  ENVS    97
4   B Alice  ENVS    93

Filtering Joins

Filtering via Join

There are times when the results you are looking for should be based upon items in another data.frame object but should not contain information from that second table.

The Semi Join

We can also use joins to filter values within one data.frame. Here the semi_join() keeps everything in the left data that has a key in the right one, but importantly it does not import the right data columns into the result.

people %>% 
  semi_join( homework, by="Key")
  Key  Name Major
1   A   Bob  ENVS
2   B Alice  ENVS

The Anti Join

The opposite of the semi_join() is the anti_join() which drops everything in the left table that has a key in the right one, leaving only the ones that are unique.

people %>%
  anti_join( homework, by = "Key")
  Key Name Major
1   C Mary  BIOL

Spatial Joins

Spatial Joins - Polygons

 

  • Imported and unzipped shape files (behind the scenes)
  • RVA Zoning District Polygons
  • Contained in data.frame object
  • Each row is a spatial record.
summary( theFan )
    OBJECTID          Name            Ordinance          OrdinanceP       
 Min.   :   5.0   Length:91          Length:91          Length:91         
 1st Qu.: 270.5   Class :character   Class :character   Class :character  
 Median : 377.0   Mode  :character   Mode  :character   Mode  :character  
 Mean   : 498.3                                                           
 3rd Qu.: 505.0                                                           
 Max.   :1872.0                                                           
  Conditiona          AdoptionDa           Comment           CreatedBy        
 Length:91          Min.   :2000-01-01   Length:91          Length:91         
 Class :character   1st Qu.:2000-01-01   Class :character   Class :character  
 Mode  :character   Median :2001-10-08   Mode  :character   Mode  :character  
                    Mean   :2005-06-16                                        
                    3rd Qu.:2008-02-18                                        
                    Max.   :2020-02-24                                        
   CreatedDat            EditBy             EditDate         
 Min.   :2020-08-24   Length:91          Min.   :2020-08-24  
 1st Qu.:2020-08-24   Class :character   1st Qu.:2020-08-24  
 Median :2020-08-24   Mode  :character   Median :2020-08-24  
 Mean   :2020-08-24                      Mean   :2020-08-24  
 3rd Qu.:2020-08-24                      3rd Qu.:2020-08-24  
 Max.   :2020-08-24                      Max.   :2020-08-24  
   GlobalID           Shape__Are         Shape__Len             geometry 
 Length:91          Min.   :    4466   Min.   :  371   MULTIPOLYGON : 4  
 Class :character   1st Qu.:   84963   1st Qu.: 1388   POLYGON      :87  
 Mode  :character   Median :  270545   Median : 2823   epsg:2284    : 0  
                    Mean   : 2697532   Mean   : 6941   +proj=lcc ...: 0  
                    3rd Qu.: 1653270   3rd Qu.: 8281                     
                    Max.   :65772905   Max.   :69888                     

Spatial Joins - Polygons

names( theFan )
 [1] "OBJECTID"   "Name"       "Ordinance"  "OrdinanceP" "Conditiona"
 [6] "AdoptionDa" "Comment"    "CreatedBy"  "CreatedDat" "EditBy"    
[11] "EditDate"   "GlobalID"   "Shape__Are" "Shape__Len" "geometry"  
theFan[1,]
Simple feature collection with 1 feature and 14 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 11781130 ymin: 3727746 xmax: 11781350 ymax: 3728014
Projected CRS: NAD83 / Virginia South (ftUS)
  OBJECTID Name Ordinance OrdinanceP Conditiona AdoptionDa Comment
5        5  B-1      <NA>       <NA>         No 2000-01-01    <NA>
           CreatedBy CreatedDat             EditBy   EditDate
5 richard.morton_cor 2020-08-24 richard.morton_cor 2020-08-24
                              GlobalID Shape__Are Shape__Len
5 f1305477-4e71-463c-a202-332971d8c5e1    33261.3   762.5714
                        geometry
5 POLYGON ((11781140 3727834,...

Spatial Joins

plot( theFan["Name"], main="")

Spatial Joins - Vector Data

 

  • Road systems in RVA
  • Another shapefile folder
summary( fanRoads )
      FID          Carriagewa       AssetID           StreetType       
 Min.   :29123   Min.   : 16873   Length:2120        Length:2120       
 1st Qu.:33102   1st Qu.: 50429   Class :character   Class :character  
 Median :37238   Median : 72438   Mode  :character   Mode  :character  
 Mean   :38715   Mean   : 63099                                        
 3rd Qu.:43488   3rd Qu.: 74663                                        
 Max.   :57927   Max.   :139427                                        
                                                                       
  Functional            FIPS            LeftFromAd         LeftToAddr       
 Length:2120        Length:2120        Length:2120        Length:2120       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  RightFromA         RightToAdd         PrefixDire         ProperName       
 Length:2120        Length:2120        Length:2120        Length:2120       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  SuffixType         SuffixDire          FullName          RouteName        
 Length:2120        Length:2120        Length:2120        Length:2120       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
    OneWay            PostedSpee     CADRouteSp     CreatedBy        
 Length:2120        Min.   : 0.0   Min.   : 5.00   Length:2120       
 Class :character   1st Qu.:25.0   1st Qu.:25.00   Class :character  
 Mode  :character   Median :25.0   Median :25.00   Mode  :character  
                    Mean   :25.9   Mean   :21.53                     
                    3rd Qu.:25.0   3rd Qu.:25.00                     
                    Max.   :55.0   Max.   :85.00                     
                    NA's   :465                                      
   CreatedDat            EditBy             EditDate         
 Min.   :2020-05-20   Length:2120        Min.   :2020-05-20  
 1st Qu.:2020-05-20   Class :character   1st Qu.:2020-05-20  
 Median :2020-05-20   Mode  :character   Median :2020-05-20  
 Mean   :2020-05-20                      Mean   :2020-05-20  
 3rd Qu.:2020-05-20                      3rd Qu.:2020-05-20  
 Max.   :2020-05-20                      Max.   :2020-05-20  
                                                             
   GlobalID           SHAPE_Leng                 geometry   
 Length:2120        Min.   :  13.64   LINESTRING     :2119  
 Class :character   1st Qu.: 135.81   MULTILINESTRING:   1  
 Mode  :character   Median : 189.10   epsg:2284      :   0  
                    Mean   : 262.39   +proj=lcc ...  :   0  
                    3rd Qu.: 373.35                         
                    Max.   :7872.64                         
                                                            

Spatial Joins

plot( st_geometry( fanRoads ) )

Cleaning Up Spatial Data Frames

fanRoads %>%
  select( FullName, OneWay, StreetType,
                 SpeedLimit = PostedSpee, Length = SHAPE_Leng,
                 geometry) %>%
  mutate( OneWay = factor( OneWay ),
          StreetType = factor( StreetType) ) -> fanRoads
summary( fanRoads )
   FullName          OneWay         StreetType     SpeedLimit  
 Length:2120        FT  : 329   Alley    : 480   Min.   : 0.0  
 Class :character   TF  : 352   Artery   : 380   1st Qu.:25.0  
 Mode  :character   NA's:1439   Highway  :  18   Median :25.0  
                                Private  :   8   Mean   :25.9  
                                Ramp     :  11   3rd Qu.:25.0  
                                Secondary:1223   Max.   :55.0  
                                                 NA's   :465   
     Length                   geometry   
 Min.   :  13.64   LINESTRING     :2119  
 1st Qu.: 135.81   MULTILINESTRING:   1  
 Median : 189.10   epsg:2284      :   0  
 Mean   : 262.39   +proj=lcc ...  :   0  
 3rd Qu.: 373.35                         
 Max.   :7872.64                         
                                         

Spatial Join

target <- theFan[ theFan$OBJECTID == 368, ]
plot( st_geometry(target ) ) 

Spatial Join - st_intersects()

fanRoads %>%
  filter( st_intersects( fanRoads, target, 
                         sparse = FALSE ) == TRUE ) %>%
  st_drop_geometry() %>%
  select( `Street Name` = FullName ) %>%
  arrange( `Street Name` ) %>% 
  unique()
           Street Name
1           Allison St
8             Birch St
12             Boyd St
17           Floyd Ave
34           Grove Ave
49         Hanover Ave
62      Kensington Ave
65         Lombardy Pl
68       Madumbie Lane
69        Monument Ave
96         N Allen Ave
119 N Arthur Ashe Blvd
121     N Brunswick St
122        N Davis Ave
152        N Granby St
161      N Harrison St
169        N Harvie St
175      N Lombardy St
186        N Meadow St
202        N Morris St
208      N Mulberry St
213          N Plum St
220      N Robinson St
237       N Rowland St
252      N Shields Ave
269     N Stafford Ave
283          N Vine St
294           Park Ave
317          Ryland St
318          S Plum St
319     S Stafford Ave
320      Strawberry St
342         Stuart Ave
350         Stuart Cir
363      Trouvaille Al
366      W Franklin St
374          W Main St
380           West Ave
384               <NA>

Airport Dataset

The Library

Four data.frames are contained within the library representing all the flights from NYC airports in 2013 (you may need to install this package).

library( nycflights13 )
  • airlines

  • airports

  • planes

  • weather

  • flights

The Coverage

Airlines

head(airlines)
# A tibble: 6 × 2
  carrier name                    
  <chr>   <chr>                   
1 9E      Endeavor Air Inc.       
2 AA      American Airlines Inc.  
3 AS      Alaska Airlines Inc.    
4 B6      JetBlue Airways         
5 DL      Delta Air Lines Inc.    
6 EV      ExpressJet Airlines Inc.

Airports

head( airports )
# A tibble: 6 × 8
  faa   name                             lat   lon   alt    tz dst   tzone      
  <chr> <chr>                          <dbl> <dbl> <dbl> <dbl> <chr> <chr>      
1 04G   Lansdowne Airport               41.1 -80.6  1044    -5 A     America/Ne…
2 06A   Moton Field Municipal Airport   32.5 -85.7   264    -6 A     America/Ch…
3 06C   Schaumburg Regional             42.0 -88.1   801    -6 A     America/Ch…
4 06N   Randall Airport                 41.4 -74.4   523    -5 A     America/Ne…
5 09J   Jekyll Island Airport           31.1 -81.4    11    -5 A     America/Ne…
6 0A9   Elizabethton Municipal Airport  36.4 -82.2  1593    -5 A     America/Ne…

Planes

head( planes )
# A tibble: 6 × 9
  tailnum  year type                    manuf…¹ model engines seats speed engine
  <chr>   <int> <chr>                   <chr>   <chr>   <int> <int> <int> <chr> 
1 N10156   2004 Fixed wing multi engine EMBRAER EMB-…       2    55    NA Turbo…
2 N102UW   1998 Fixed wing multi engine AIRBUS… A320…       2   182    NA Turbo…
3 N103US   1999 Fixed wing multi engine AIRBUS… A320…       2   182    NA Turbo…
4 N104UW   1999 Fixed wing multi engine AIRBUS… A320…       2   182    NA Turbo…
5 N10575   2002 Fixed wing multi engine EMBRAER EMB-…       2    55    NA Turbo…
6 N105UW   1999 Fixed wing multi engine AIRBUS… A320…       2   182    NA Turbo…
# … with abbreviated variable name ¹​manufacturer

Weather

head(weather)
# A tibble: 6 × 15
  origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust
  <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>
1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4         NA
2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06        NA
3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5         NA
4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7         NA
5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7         NA
6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5         NA
# … with 4 more variables: precip <dbl>, pressure <dbl>, visib <dbl>,
#   time_hour <dttm>
# ℹ Use `colnames()` to see all variable names

Flights

This is the main data.frame.

names( flights )
 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"     

Questions

If you have any questions, please feel free to either post them as an “Issue” on your copy of this GitHub Repository, post to the Canvas discussion board for the class, or drop me an email.

Peter Sellers looking bored